[Zen2] Introduce ClusterBootstrapService #35488

DaveCTurner · 2018-11-13T12:03:48Z

Today, the bootstrapping of a Zen2 cluster is driven externally, requiring
something else to wait for discovery to converge and then to inject the initial
configuration. This is hard to use in some situations, such as REST tests.

This change introduces the ClusterBootstrapService which brings the bootstrap
retry logic within each node and allows it to be controlled via an (unsafe)
node setting.

Today, the bootstrapping of a Zen2 cluster is driven externally, requiring something else to wait for discovery to converge and then to inject the initial configuration. This is hard to use in some situations, such as REST tests. This change introduces the `ClusterBootstrapService` which brings the bootstrap retry logic within each node and allows it to be controlled via an (unsafe) node setting.

elasticmachine · 2018-11-13T12:03:49Z

Pinging @elastic/es-distributed

ywelsch

Left a few small comments, looks good o.w.

ywelsch · 2018-11-14T11:06:44Z

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java

+
+                @Override
+                public void handleException(TransportException exp) {
+                    if (exp.getRootCause() instanceof ElasticsearchTimeoutException) {


should we simply set the timeout to a very high value instead of adding this retry logic here?

How high is high enough? I'd prefer to retry forever rather than have to remember that this might time out and stop in a future debugging session.

We can also make the timeout optional, so that setting it to null makes it unbounded (i.e. does not schedule a timeout)

Ok, introduced nullability to the timeout in 9f7e951.

ywelsch · 2018-11-14T11:09:31Z

server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java

+            new TransportResponseHandler<BootstrapClusterResponse>() {
+                @Override
+                public void handleResponse(BootstrapClusterResponse response) {
+                    logger.debug("bootstrapped successful: received {}", response);


successfully?

Fixed the message in 12d90d2 - I meant "bootstrapping" not "bootstrapped".

ywelsch · 2018-11-14T11:14:02Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

        for (int i = 0; i < numSharedDedicatedMasterNodes; i++) {
            final Settings.Builder settings = Settings.builder();
            settings.put(Node.NODE_MASTER_SETTING.getKey(), true);
            settings.put(Node.NODE_DATA_SETTING.getKey(), false);
+            if (bootstrapNodeRequired) {
+                settings.put(INITIAL_MASTER_NODE_COUNT_SETTING.getKey(), numSharedDedicatedMasterNodes);
+                bootstrapNodeRequired = false;


why only run the bootstrapping on a single node?

It's sufficient and more obviously correct, but either way is fine by me. Running it on multiple nodes requires validateClusterFormed() before the next node starts up, and this isn't especially clear. Your call.

We discussed this on another channel, and decided to only do the auto-bootstrapping when autoManageMinMaster mode is active. We also decided to have multiple nodes participate in / run the bootstrapping process.

Done in 150f47b

ywelsch · 2018-11-14T11:14:54Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

+                        (int) Stream.of(settings).filter(Node.NODE_MASTER_SETTING::get).count())
+                    .put(nodeSettings)
+                    .build();
+                bootstrapNodeRequired = false;


same here, why only bootstrap one node?

Done in 150f47b.

ywelsch

LGTM

ywelsch · 2018-11-15T13:21:56Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

@@ -1081,6 +1081,9 @@ private synchronized void reset(boolean wipeData) throws IOException {
            final Settings.Builder settings = Settings.builder();
            settings.put(Node.NODE_MASTER_SETTING.getKey(), true);
            settings.put(Node.NODE_DATA_SETTING.getKey(), false);
+            if (prevNodeCount == 0 && autoManageMinMasterNodes) {
+                settings.put(INITIAL_MASTER_NODE_COUNT_SETTING.getKey(), numSharedDedicatedMasterNodes + numSharedDataNodes);


I think this needs to be

settings.put(INITIAL_MASTER_NODE_COUNT_SETTING.getKey(), numSharedDedicatedMasterNodes + (numSharedDedicatedMasterNodes > 0 ? 0 : numSharedDataNodes));

because of the condition further below where we make the shared data nodes data-only nodes when there are dedicated master nodes.

…re dedicated masters

DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Nov 13, 2018

DaveCTurner requested a review from ywelsch November 13, 2018 12:03

Add tests for stopping

bb41ca9

ywelsch mentioned this pull request Nov 13, 2018

A new cluster coordination layer #32006

Closed

61 tasks

DaveCTurner added 4 commits November 13, 2018 13:30

Bogus assertion

63631fc

Merge branch 'zen2' into 2018-11-13-simple-unsafe-test-bootstrapping

6e13fda

Allow _system to bootstrap a cluster

d56d314

Merge branch 'zen2' into 2018-11-13-simple-unsafe-test-bootstrapping

5c8f24d

ywelsch reviewed Nov 14, 2018

View reviewed changes

DaveCTurner added 4 commits November 14, 2018 13:57

Bootstrap on all nodes but only if autoManageMinMasterNodes

150f47b

Wait indefinitely, do not retry

9f7e951

Merge branch 'zen2' into 2018-11-13-simple-unsafe-test-bootstrapping

85daaf1

Fix message

12d90d2

DaveCTurner requested a review from ywelsch November 14, 2018 16:24

ywelsch approved these changes Nov 14, 2018

View reviewed changes

DaveCTurner added 4 commits November 14, 2018 18:20

Inline once-called method

b5bed1d

More inlining

49e19ec

Merge branch 'zen2' into 2018-11-13-simple-unsafe-test-bootstrapping

843f62d

Import

a6221d4

ywelsch reviewed Nov 15, 2018

View reviewed changes

Fix initial master node count - data nodes are not masters if there a…

e480d00

…re dedicated masters

DaveCTurner merged commit 86ef041 into elastic:zen2 Nov 15, 2018

DaveCTurner deleted the 2018-11-13-simple-unsafe-test-bootstrapping branch November 15, 2018 20:09

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Zen2] Introduce ClusterBootstrapService #35488

[Zen2] Introduce ClusterBootstrapService #35488

DaveCTurner commented Nov 13, 2018

elasticmachine commented Nov 13, 2018

ywelsch left a comment

ywelsch Nov 14, 2018

DaveCTurner Nov 14, 2018

ywelsch Nov 14, 2018

DaveCTurner Nov 14, 2018

ywelsch Nov 14, 2018

DaveCTurner Nov 14, 2018

ywelsch Nov 14, 2018

DaveCTurner Nov 14, 2018

ywelsch Nov 14, 2018

DaveCTurner Nov 14, 2018

ywelsch Nov 14, 2018

DaveCTurner Nov 14, 2018

ywelsch left a comment

ywelsch Nov 15, 2018

[Zen2] Introduce ClusterBootstrapService #35488

[Zen2] Introduce ClusterBootstrapService #35488

Conversation

DaveCTurner commented Nov 13, 2018

elasticmachine commented Nov 13, 2018

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment